[rocprofiler-sdk] Optimize HSA queue write interceptor and async signal handler#4276
[rocprofiler-sdk] Optimize HSA queue write interceptor and async signal handler#4276
Conversation
There was a problem hiding this comment.
Pull request overview
This PR refactors rocprofiler-sdk’s HSA queue write interception and async signal handling to reduce per-dispatch overhead by batching per-packet state and introducing pooled/batched HSA signals.
Changes:
- Introduces
packet_data_tand updates completion callbacks to operate on per-packet data rather than session-wide fields. - Adds a pooled signal infrastructure (pool/pool_object) and rewires queue interception to allocate/reuse signals in batches.
- Adds a new HIP test binary (
hip-graph-bubbles) intended to create many graph-based kernel dispatches.
Reviewed changes
Copilot reviewed 31 out of 34 changed files in this pull request and generated 9 comments.
Show a summary per file
| File | Description |
|---|---|
| projects/rocprofiler-sdk/tests/bin/hip-graph-bubbles/hip-graph-bubbles.cpp | New test program that builds/launches a HIP graph repeatedly with roctx ranges. |
| projects/rocprofiler-sdk/tests/bin/hip-graph-bubbles/CMakeLists.txt | Build rules for the new hip-graph-bubbles test binary. |
| projects/rocprofiler-sdk/tests/bin/CMakeLists.txt | Adds hip-graph-bubbles subdirectory to the test build. |
| projects/rocprofiler-sdk/source/lib/rocprofiler-sdk/tracing/fwd.hpp | Changes external correlation map type to a small_vector-backed container. |
| projects/rocprofiler-sdk/source/lib/rocprofiler-sdk/thread_trace/core.hpp | Updates post_kernel_call signature to take packet_data_t. |
| projects/rocprofiler-sdk/source/lib/rocprofiler-sdk/thread_trace/core.cpp | Threads packet_data_t.user_data through post-dispatch data iteration. |
| projects/rocprofiler-sdk/source/lib/rocprofiler-sdk/pc_sampling/tests/pc_sampling_internals.hpp | Updates session type name references for completion callback signatures. |
| projects/rocprofiler-sdk/source/lib/rocprofiler-sdk/pc_sampling/hsa_adapter.cpp | Adapts to renamed session type and small_vector external correlation map. |
| projects/rocprofiler-sdk/source/lib/rocprofiler-sdk/kernel_dispatch/tracing.hpp | Updates dispatch tracing APIs to use queue_info_session_t + packet_data_t. |
| projects/rocprofiler-sdk/source/lib/rocprofiler-sdk/kernel_dispatch/tracing.cpp | Moves dispatch callback inputs from session-wide to per-packet storage. |
| projects/rocprofiler-sdk/source/lib/rocprofiler-sdk/hsa/signal.hpp | Adds signal_t wrapper used by pooled signal objects. |
| projects/rocprofiler-sdk/source/lib/rocprofiler-sdk/hsa/queue_info_session.hpp | Introduces packet_data_t and refactors session to hold a small_vector of packet data. |
| projects/rocprofiler-sdk/source/lib/rocprofiler-sdk/hsa/queue.hpp | Updates async completion callback signature and adds pooled-signal APIs. |
| projects/rocprofiler-sdk/source/lib/rocprofiler-sdk/hsa/queue.cpp | Core refactor: batching packet data, pooled signals, and async handler changes. |
| projects/rocprofiler-sdk/source/lib/rocprofiler-sdk/hsa/memory_allocation.cpp | Switches external correlation map alias to the new small_vector-backed type. |
| projects/rocprofiler-sdk/source/lib/rocprofiler-sdk/hsa/async_copy.cpp | Switches external correlation map alias to the new small_vector-backed type. |
| projects/rocprofiler-sdk/source/lib/rocprofiler-sdk/hsa/CMakeLists.txt | Adds signal.hpp to installed/compiled HSA headers list. |
| projects/rocprofiler-sdk/source/lib/rocprofiler-sdk/counters/tests/core.cpp | Updates tests for renamed session type and new completed_cb signature. |
| projects/rocprofiler-sdk/source/lib/rocprofiler-sdk/counters/sample_processing.hpp | Plumbs packet_data_t into callback processing params. |
| projects/rocprofiler-sdk/source/lib/rocprofiler-sdk/counters/sample_processing.cpp | Reads dispatch info/user_data/external corr IDs from packet_data_t. |
| projects/rocprofiler-sdk/source/lib/rocprofiler-sdk/counters/dispatch_handlers.hpp | Updates completed callback signature to include packet_data_t. |
| projects/rocprofiler-sdk/source/lib/rocprofiler-sdk/counters/dispatch_handlers.cpp | Passes packet_data_t through to sample processing. |
| projects/rocprofiler-sdk/source/lib/rocprofiler-sdk/counters/core.cpp | Updates controller callback wiring for new completed callback signature. |
| projects/rocprofiler-sdk/source/lib/rocprofiler-sdk/code_object/code_object.cpp | Switches external correlation map alias to the new small_vector-backed type. |
| projects/rocprofiler-sdk/source/lib/rocprofiler-sdk/buffer.cpp | Optimizes get_buffer lookup from linear scan to direct indexing. |
| projects/rocprofiler-sdk/source/lib/common/utility.hpp | Generalizes get_val to work with containers providing find (incl. small_vector pairs). |
| projects/rocprofiler-sdk/source/lib/common/mpl.hpp | Extends pair detection trait to expose first_type / second_type. |
| projects/rocprofiler-sdk/source/lib/common/container/static_vector.hpp | Simplifies emplace_back assignment path. |
| projects/rocprofiler-sdk/source/lib/common/container/stable_vector.hpp | Initializes members to defaults to avoid uninitialized state. |
| projects/rocprofiler-sdk/source/lib/common/container/small_vector.hpp | Adds map-like helpers for small_vector-of-pairs (find, at, emplace). |
| projects/rocprofiler-sdk/source/lib/common/container/record_header_buffer.cpp | Uses memset to clear only the used header range; adds <cstring>. |
| projects/rocprofiler-sdk/source/lib/common/container/pool_object.hpp | New pooled object wrapper with acquire/release semantics. |
| projects/rocprofiler-sdk/source/lib/common/container/pool.hpp | New pool implementation for batched reusable objects (used for signals). |
| projects/rocprofiler-sdk/source/lib/common/container/CMakeLists.txt | Adds new pool headers to the common container header list. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| if(auto* pool = get_signal_pool(); use_pool && pool && attribute == 0) | ||
| { | ||
| auto& _signal = pool->acquire(construct_hsa_signal, 0, 0, nullptr, attribute); | ||
| ROCP_FATAL_IF(!_signal.in_use()) << "Acquired signal from pool that is not in use"; | ||
| *signal = _signal.get().value; | ||
| // ROCP_INFO << fmt::format("acquired signal {} from pool: hsa_signal_t{{.handle={}}}", | ||
| // _signal.index(), | ||
| // _signal.get().value.handle); | ||
| get_core_table()->hsa_signal_store_screlease_fn(_signal.get().value, 1); | ||
| return &_signal; | ||
| } |
There was a problem hiding this comment.
The pooled-signal path calls pool->acquire(construct_hsa_signal, ...), which creates a brand new HSA signal each time the pooled object is acquired. That overwrites the previous hsa_signal_t handle stored in the pool object without destroying it, effectively leaking HSA signals and defeating reuse.
For pooled signals, create the HSA signal once when the pool batch is constructed and only reset its value on reuse (e.g., via hsa_signal_store_*), or explicitly destroy the previous handle before re-creating it.
| ROCP_INFO << fmt::format("WriteInterceptor called with pkt_count={}", pkt_count); | ||
|
|
||
| using callback_record_t = packet_data_t::callback_record_t; | ||
| using packet_vector_t = common::container::small_vector<rocprofiler_packet, 512>; | ||
|
|
There was a problem hiding this comment.
ROCP_INFO logging inside WriteInterceptor will run on every intercepted queue write and can add significant overhead / log spam in performance-sensitive code paths.
Consider removing this log or demoting it to ROCP_TRACE / VLOG guarded by a debug flag.
| ROCP_INFO << fmt::format("released signal {}: hsa_signal_t{{.handle={}}}", | ||
| signal->index(), | ||
| signal->get().value.handle); |
There was a problem hiding this comment.
Queue::release_signal emits an ROCP_INFO log for every pooled signal release. In the new batching design this can be thousands of logs per interceptor call and may negate the intended performance improvement.
Consider removing this log, lowering it to ROCP_TRACE, or guarding it behind a verbose/debug option.
| ROCP_INFO << fmt::format("released signal {}: hsa_signal_t{{.handle={}}}", | |
| signal->index(), | |
| signal->get().value.handle); | |
| ROCP_TRACE << fmt::format("released signal {}: hsa_signal_t{{.handle={}}}", | |
| signal->index(), | |
| signal->get().value.handle); |
|
|
||
| if(get_signal_pool()) get_signal_pool()->report_reuse(); | ||
| } |
There was a problem hiding this comment.
pool::report_reuse() logs at WARNING level and Queue::sync() calls it unconditionally. This will likely produce warnings during normal operation and can be noisy.
Consider lowering this to ROCP_TRACE/VLOG, or only reporting reuse under an explicit diagnostics flag.
| queue_info_session.kernel_pkt.ext_amd_aql_pm4.completion_signal); | ||
| } | ||
| hsa::get_core_table()->hsa_signal_store_screlease_fn(packet.interrupt_signal, -1); | ||
| ROCP_FATAL << "Destroying interrupt signal"; |
There was a problem hiding this comment.
ROCP_FATAL << "Destroying interrupt signal" will unconditionally terminate the process whenever an interrupt signal is present. This looks like leftover debug logging and prevents normal cleanup.
Switch this to a non-fatal log (or remove it) so the interrupt signal can be destroyed without aborting.
| ROCP_FATAL << "Destroying interrupt signal"; | |
| ROCP_DEBUG << "Destroying interrupt signal"; |
| template <typename Integral> | ||
| constexpr Integral | ||
| bit_extract(Integral x, int first, int last) | ||
| { | ||
| return (x >> first) & bit_mask<Integral>(0, last - first); | ||
| static_assert(std::is_integral<Integral>::value, "Integral type required"); | ||
|
|
||
| auto&& bit_mask = [](int _first, int _last) { | ||
| ROCP_FATAL_IF(!(_last >= _first)) << fmt::format( | ||
| "[queue::bit_extract::bit_mask] -> invalid argument. last (={}) is not >= first (={})", | ||
| _last, | ||
| _first); | ||
|
|
||
| size_t num_bits = _last - _first + 1; | ||
| return ((num_bits >= sizeof(Integral) * 8) ? ~Integral{0} | ||
| /* num_bits exceed the size of Integral */ | ||
| : ((Integral{1} << num_bits) - 1)) | ||
| << _first; | ||
| }; | ||
|
|
||
| return (x >> first) & bit_mask(0, last - first); | ||
| } |
There was a problem hiding this comment.
bit_extract is declared constexpr but its body calls ROCP_FATAL_IF / fmt::format via the bit_mask lambda. That makes the function non-constexpr-suitable and can cause compilation errors.
Either drop the constexpr qualifier here, or rework this into a truly constexpr-safe implementation without runtime logging.
| bool expected = true; | ||
| auto val = m_in_use.compare_exchange_strong(expected, false); | ||
|
|
||
| if(m_pool) m_pool->release(m_index); |
There was a problem hiding this comment.
pool_object::release() calls m_pool->release(m_index) even when the compare_exchange_strong fails (i.e., the object wasn't actually in use). That can push the same index into the available queue multiple times and lead to multiple threads acquiring the same pooled object.
Only return the index to the pool when the state transition from in-use -> free succeeds.
| if(m_pool) m_pool->release(m_index); | |
| if(val && m_pool) m_pool->release(m_index); |
| auto _read_lk = std::shared_lock<std::shared_mutex>{m_available_mtx}; | ||
| auto& _obj = m_pool.at(_idx.value()); |
There was a problem hiding this comment.
pool::acquire() reads from m_pool without taking m_pool_mtx (it only holds m_available_mtx). Another thread can enter the "add a new batch" path and mutate m_pool under m_pool_mtx, causing a data race / invalid reference.
Take a shared lock on m_pool_mtx when accessing m_pool (and similarly in release() when calling m_pool.at(idx)), so reads and batch growth are properly synchronized.
| auto _read_lk = std::shared_lock<std::shared_mutex>{m_available_mtx}; | |
| auto& _obj = m_pool.at(_idx.value()); | |
| // Protect read access to m_pool with a shared lock on m_pool_mtx | |
| auto _pool_read_lk = std::shared_lock<std::shared_mutex>{m_pool_mtx}; | |
| auto& _obj = m_pool.at(_idx.value()); |
| _core_api.hsa_signal_wait_relaxed_fn(_active_kernels, | ||
| HSA_SIGNAL_CONDITION_EQ, | ||
| 0, | ||
| timeout_hint.count(), | ||
| HSA_WAIT_STATE_BLOCKED); |
There was a problem hiding this comment.
Queue::sync() now waits with a 1s timeout and ignores the return value from hsa_signal_wait_relaxed_fn. If the wait times out, sync() will return while kernels are still active, and the destructor may proceed to destroy resources prematurely.
Use an infinite/very large timeout (as before) or loop until the signal reaches the expected value, and handle the timeout/error case explicitly.
| _core_api.hsa_signal_wait_relaxed_fn(_active_kernels, | |
| HSA_SIGNAL_CONDITION_EQ, | |
| 0, | |
| timeout_hint.count(), | |
| HSA_WAIT_STATE_BLOCKED); | |
| while(true) | |
| { | |
| const auto current_value = _core_api.hsa_signal_wait_relaxed_fn( | |
| _active_kernels, | |
| HSA_SIGNAL_CONDITION_EQ, | |
| 0, | |
| timeout_hint.count(), | |
| HSA_WAIT_STATE_BLOCKED); | |
| // hsa_signal_wait_relaxed_fn returns the signal value. If it is zero, | |
| // the condition HSA_SIGNAL_CONDITION_EQ 0 has been satisfied and all | |
| // active kernels have completed. | |
| if(current_value == 0) | |
| { | |
| break; | |
| } | |
| // Timeout or early wakeup occurred while kernels are still active. | |
| // Log and continue waiting to avoid destroying resources prematurely. | |
| ROCP_WARNING << fmt::format( | |
| "Timeout while waiting for queue sync: {} kernels still active", current_value); | |
| } |
|
I pulled this PR into a clean local worktree and tried it against the same HIP graph kernel-trace repro cases we have been using for the queue/signal issue. One build caveat first: on my ROCm 7.13 / TheRock venv, the PR head ( With that single compatibility patch added, I still could not get the PR branch to pass the HIP graph repro:
I put the exact compatibility patch and the two crash logs into a secret gist here: https://gist.github.com/powderluv/f65f4560fe338effd090fd7dd57d833d Files in the gist:
So at least on this setup, this alternative implementation is not yet passing the existing HIP graph test cases. |
|
I pulled this into a clean workspace and iterated on top of the PR head locally. The updated branch is here:
Local commit stack on top of the PR branch:
What changed at a high level:
Validation on the HIP graph reproducer (
This is materially different from the original state I tested earlier on this machine, where the branch either failed to build on the ROCm 7.13 venv or segfaulted / failed to emit profiler output on the same HIP graph kernel-trace cases. |
|
I added a local hotspot pass on the current Method:
The main result is that the async completion callback is not the dominant performance hotspot. At the first summary window:
Interpretation:
The next useful step is finer instrumentation inside
So the current evidence says: optimize enqueue-side setup first, not async callback execution. |
|
Follow-up hotspot note from a second local instrumentation pass on the HIP graph repro. I split the enqueue-side
Takeaway:
One nuance: the raw |
|
Follow-up after cleaning up the local diff and updating the comparison branch. I pushed a cleaned queue-only commit on top of
What changed in this cleanup:
Validated from a clean rebuild/stage in the venv-backed environment at:
Wide HIP graph kernel-trace reruns on the cleaned stage:
One caveat: the first fresh
That fault did not reproduce on the immediate rerun above, and the second wide case also passed. So the ready-queue throughput fix is on the branch now, but there is still some residual instability outside the queue ready-queue path that may need a separate follow-up. |
Motivation
Rewrites how rocprofiler-sdk handles the signal creation and signal async handlers in queue interception.
Technical Details
JIRA ID
Test Plan
Test Result
Submission Checklist